77 research outputs found
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
Estimating set similarity and detecting highly similar sets are fundamental
problems in areas such as databases, machine learning, and information
retrieval. MinHash is a well-known technique for approximating Jaccard
similarity of sets and has been successfully used for many applications such as
similarity search and large scale learning. Its two compressed versions, b-bit
MinHash and Odd Sketch, can significantly reduce the memory usage of the
original MinHash method, especially for estimating high similarities (i.e.,
similarities around 1). Although MinHash can be applied to static sets as well
as streaming sets, of which elements are given in a streaming fashion and
cardinality is unknown or even infinite, unfortunately, b-bit MinHash and Odd
Sketch fail to deal with streaming data. To solve this problem, we design a
memory efficient sketch method, MaxLogHash, to accurately estimate Jaccard
similarities in streaming sets. Compared to MinHash, our method uses smaller
sized registers (each register consists of less than 7 bits) to build a compact
sketch for each set. We also provide a simple yet accurate estimator for
inferring Jaccard similarity from MaxLogHash sketches. In addition, we derive
formulas for bounding the estimation error and determine the smallest necessary
memory usage (i.e., the number of registers used for a MaxLogHash sketch) for
the desired accuracy. We conduct experiments on a variety of datasets, and
experimental results show that our method MaxLogHash is about 5 times more
memory efficient than MinHash with the same accuracy and computational cost for
estimating high similarities
Improved Densification of One Permutation Hashing
The existing work on densification of one permutation hashing reduces the
query processing cost of the -parameterized Locality Sensitive Hashing
(LSH) algorithm with minwise hashing, from to merely ,
where is the number of nonzeros of the data vector, is the number of
hashes in each hash table, and is the number of hash tables. While that is
a substantial improvement, our analysis reveals that the existing densification
scheme is sub-optimal. In particular, there is no enough randomness in that
procedure, which affects its accuracy on very sparse datasets.
In this paper, we provide a new densification procedure which is provably
better than the existing scheme. This improvement is more significant for very
sparse datasets which are common over the web. The improved technique has the
same cost of for query processing, thereby making it strictly
preferable over the existing procedure. Experimental evaluations on public
datasets, in the task of hashing based near neighbor search, support our
theoretical findings
Improved Asymmetric Locality Sensitive Hashing (ALSH) for Maximum Inner Product Search (MIPS)
Recently it was shown that the problem of Maximum Inner Product Search (MIPS)
is efficient and it admits provably sub-linear hashing algorithms. Asymmetric
transformations before hashing were the key in solving MIPS which was otherwise
hard. In the prior work, the authors use asymmetric transformations which
convert the problem of approximate MIPS into the problem of approximate near
neighbor search which can be efficiently solved using hashing. In this work, we
provide a different transformation which converts the problem of approximate
MIPS into the problem of approximate cosine similarity search which can be
efficiently solved using signed random projections. Theoretical analysis show
that the new scheme is significantly better than the original scheme for MIPS.
Experimental evaluations strongly support the theoretical findings.Comment: arXiv admin note: text overlap with arXiv:1405.586
In Defense of MinHash Over SimHash
MinHash and SimHash are the two widely adopted Locality Sensitive Hashing
(LSH) algorithms for large-scale data processing applications. Deciding which
LSH to use for a particular problem at hand is an important question, which has
no clear answer in the existing literature. In this study, we provide a
theoretical answer (validated by experiments) that MinHash virtually always
outperforms SimHash when the data are binary, as common in practice such as
search.
The collision probability of MinHash is a function of resemblance similarity
(), while the collision probability of SimHash is a function of
cosine similarity (). To provide a common basis for comparison, we
evaluate retrieval results in terms of for both MinHash and
SimHash. This evaluation is valid as we can prove that MinHash is a valid LSH
with respect to , by using a general inequality . Our worst case analysis can
show that MinHash significantly outperforms SimHash in high similarity region.
Interestingly, our intensive experiments reveal that MinHash is also
substantially better than SimHash even in datasets where most of the data
points are not too similar to each other. This is partly because, in practical
data, often holds where
is only slightly larger than 2 (e.g., ). Our restricted worst case
analysis by assuming shows that MinHash indeed significantly
outperforms SimHash even in low similarity region.
We believe the results in this paper will provide valuable guidelines for
search in practice, especially when the data are sparse
Graph Kernels via Functional Embedding
We propose a representation of graph as a functional object derived from the
power iteration of the underlying adjacency matrix. The proposed functional
representation is a graph invariant, i.e., the functional remains unchanged
under any reordering of the vertices. This property eliminates the difficulty
of handling exponentially many isomorphic forms. Bhattacharyya kernel
constructed between these functionals significantly outperforms the
state-of-the-art graph kernels on 3 out of the 4 standard benchmark graph
classification datasets, demonstrating the superiority of our approach. The
proposed methodology is simple and runs in time linear in the number of edges,
which makes our kernel more efficient and scalable compared to many widely
adopted graph kernels with running time cubic in the number of vertices
- β¦